AITopics | error reason

Collaborating Authors

error reason

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

d81cb1f4dc6e13aeb45553f80b3d6837-Paper-Conference.pdf

Neural Information Processing SystemsFeb-18-2026, 08:05:19 GMT

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Austria > Vienna (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(13 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

52764eb83bf0a0bd32766ce5c01612e5-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 02:36:59 GMT

gpt-4v, image 1, symbolize, (11 more...)

Neural Information Processing Systems

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
North America > United States > New York > Erie County > Amherst (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(6 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Communications (1.00)
(7 more...)

Add feedback

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

Liu, Ziqiang, Fang, Feiteng, Feng, Xi, Du, Xinrun, Zhang, Chenhao, Wang, Zekun, Bai, Yuelin, Zhao, Qixuan, Fan, Liyang, Gan, Chengguang, Lin, Hongquan, Li, Jiaming, Ni, Yuansheng, Wu, Haihong, Narsupalli, Yaswanth, Zheng, Zhigang, Li, Chengming, Hu, Xiping, Xu, Ruifeng, Chen, Xiaojun, Yang, Min, Liu, Jiaheng, Liu, Ruibo, Huang, Wenhao, Zhang, Ge, Ni, Shiwen

arXiv.org Artificial IntelligenceJun-11-2024

The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at https://huggingface.co/datasets/m-a-p/II-Bench.

gpt-4v, image 1, symbolize, (11 more...)

arXiv.org Artificial Intelligence

2406.05862

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
North America > United States > New York (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Leisure & Entertainment (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AFaCTA: Assisting the Annotation of Factual Claim Detection with Reliable LLM Annotators

Ni, Jingwei, Shi, Minjing, Stammbach, Dominik, Sachan, Mrinmaya, Ash, Elliott, Leippold, Markus

arXiv.org Artificial IntelligenceJun-2-2024

With the rise of generative AI, automated fact-checking methods to combat misinformation are becoming more and more important. However, factual claim detection, the first step in a fact-checking pipeline, suffers from two key issues that limit its scalability and generalizability: (1) inconsistency in definitions of the task and what a claim is, and (2) the high cost of manual annotation. To address (1), we review the definitions in related work and propose a unifying definition of factual claims that focuses on verifiability. To address (2), we introduce AFaCTA (Automatic Factual Claim deTection Annotator), a novel framework that assists in the annotation of factual claims with the help of large language models (LLMs). AFaCTA calibrates its annotation confidence with consistency along three predefined reasoning paths. Extensive evaluation and experiments in the domain of political speech reveal that AFaCTA can efficiently assist experts in annotating factual claims and training high-quality classifiers, and can work with or without expert supervision. Our analyses also result in PoliClaim, a comprehensive claim detection dataset spanning diverse political topics.

afacta, error reason, information, (14 more...)

arXiv.org Artificial Intelligence

2402.11073

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Alaska (0.05)
North America > United States > Colorado (0.04)
(18 more...)

Genre: Research Report > Experimental Study (0.68)

Industry:

Media > News (1.00)
Law (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

MR-GSM8K: A Meta-Reasoning Revolution in Large Language Model Evaluation

Zeng, Zhongshen, Chen, Pengguang, Liu, Shu, Jiang, Haiyun, Jia, Jiaya

arXiv.org Artificial IntelligenceFeb-6-2024

In this work, we introduce a novel evaluation paradigm for Large Language Models, one that challenges them to engage in meta-reasoning. This approach addresses critical shortcomings in existing math problem-solving benchmarks, traditionally used to evaluate the cognitive capabilities of agents. Our paradigm shifts the focus from result-oriented assessments, which often overlook the reasoning process, to a more holistic evaluation that effectively differentiates the cognitive capabilities among models. For example, in our benchmark, GPT-4 demonstrates a performance five times better than GPT3-5. The significance of this new paradigm lies in its ability to reveal potential cognitive deficiencies in LLMs that current benchmarks, such as GSM8K, fail to uncover due to their saturation and lack of effective differentiation among varying reasoning abilities. Our comprehensive analysis includes several state-of-the-art math models from both open-source and closed-source communities, uncovering fundamental deficiencies in their training and evaluation approaches. This paper not only advocates for a paradigm shift in the assessment of LLMs but also contributes to the ongoing discourse on the trajectory towards Artificial General Intelligence (AGI). By promoting the adoption of meta-reasoning evaluation methods similar to ours, we aim to facilitate a more accurate assessment of the true cognitive abilities of LLMs.

arxiv, benchmark, language model, (15 more...)

arXiv.org Artificial Intelligence

2312.1708

Country: Asia > China > Hong Kong (0.04)

Genre: Research Report (0.64)

Industry:

Health & Medicine (1.00)
Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

Zhang, Ge, Du, Xinrun, Chen, Bei, Liang, Yiming, Luo, Tongxu, Zheng, Tianyu, Zhu, Kang, Cheng, Yuyang, Xu, Chunpu, Guo, Shuyue, Zhang, Haoran, Qu, Xingwei, Wang, Junjie, Yuan, Ruibin, Li, Yizhi, Wang, Zekun, Liu, Yudong, Tsai, Yu-Hsuan, Zhang, Fengji, Lin, Chenghua, Huang, Wenhao, Chen, Wenhu, Fu, Jie

arXiv.org Artificial IntelligenceJan-22-2024

As the capabilities of large multimodal models (LMMs) continue to advance, evaluating the performance of LMMs emerges as an increasing need. Additionally, there is an even larger gap in evaluating the advanced knowledge and reasoning abilities of LMMs in non-English contexts such as Chinese. We introduce CMMMU, a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. CMMMU is inspired by and strictly follows the annotation and analysis pattern of MMMU. CMMMU includes 12k manually collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering, like its companion, MMMU. These questions span 30 subjects and comprise 39 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU focuses on complex perception and reasoning with domain-specific knowledge in the Chinese context. We evaluate 11 open-source LLMs and one proprietary GPT-4V(ision). Even GPT-4V only achieves accuracies of 42%, indicating a large space for improvement. CMMMU will boost the community to build the next-generation LMMs towards expert artificial intelligence and promote the democratization of LMMs by providing diverse language contexts.

error category, gpt-4v, ground truth, (13 more...)

arXiv.org Artificial Intelligence

2401.11944

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > China > Hong Kong (0.04)
North America > United States (0.04)

Genre: Research Report (0.81)

Industry:

Energy (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.92)
Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback